Interactive¶
applymap/apply/map
value_counts
list comprehension hconcat
import pandas as pd
import altair as alt
import numpy as np
Example with data from Spotify¶
Here we use the spotify_dataset.csv file from Canvas. The dataset originally came from Kaggle here. The Kaggle page includes a description of the columns.
We perform some “cleaning” of the dataset. By the end of Math 10, all of the following cell should be understandable, but for now, you shouldn’t worry about the details of this “cleaning”.
Important: You may need to change the path from data/spotify_dataset.csv, depending on where you have this csv file stored.
df = pd.read_csv("data/spotify_dataset.csv") # change path if necessary
df = df.replace(" ",np.nan)
df["Streams"] = df["Streams"].str.replace(",","")
df.iloc[:,[5,7]] = df.iloc[:,[5,7]].apply(pd.to_numeric,axis=0).copy()
df.iloc[:,12:22] = df.iloc[:,12:22].apply(pd.to_numeric,axis=0).copy()
Scatter plot¶
The following Altair chart is just like what we made above with our random DataFrame. We again use the column names to specify which parts of the data to use. Before we used column names like “a” and “b”. Here the column names are more descriptive, like “Energy” and “Loudness”.
df = df[df["Chord"].notna()].copy()
chords = sorted(list(set(df["Chord"])))
chords
['A',
'A#/Bb',
'B',
'C',
'C#/Db',
'D',
'D#/Eb',
'E',
'F',
'F#/Gb',
'G',
'G#/Ab']
df["Chord"].value_counts().max()
214
df["Natural"] = df["Chord"].map(lambda x: 1 if len(x) == 1 else 0)
brush = alt.selection_interval(empty='none')
chart1 = alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Valence",
color = 'Chord',
tooltip = ["Artist","Song Name","Release Date","Chord"]
).add_selection(
brush,
)
chart2 = alt.Chart(df).mark_bar().encode(
x = alt.X("Chord",scale=alt.Scale(domain=chords)),
y = alt.Y("count()",scale=alt.Scale(domain=[0,220])),
color="Chord",
).transform_filter(
brush,
)
chart1 | chart2
brush = alt.selection_single(empty='none',fields=["Chord"],on='mouseover')
chart1 = alt.Chart(df).mark_circle().encode(
x = alt.X("Energy",scale=alt.Scale(domain=[0,1])),
y = alt.Y("Valence",scale=alt.Scale(domain=[0,1])),
color = 'Chord',
tooltip = ["Artist","Song Name","Release Date","Chord"]
).transform_filter(
brush
)
chart2 = alt.Chart(df).mark_bar().encode(
x = alt.X("Chord",scale=alt.Scale(domain=chords)),
y = alt.Y("count()",scale=alt.Scale(domain=[0,220])),
color="Chord",
).add_selection(
brush,
)
chart1 | chart2
brush = alt.selection_multi(empty='none',fields=["Chord"],on='click')
chart1 = alt.Chart(df).mark_circle().encode(
x = alt.X("Energy",scale=alt.Scale(domain=[0,1])),
y = alt.Y("Valence",scale=alt.Scale(domain=[0,1])),
color = 'Chord',
tooltip = ["Artist","Song Name","Release Date","Chord"]
).transform_filter(
brush
)
chart2 = alt.Chart(df).mark_bar().encode(
x = alt.X("Chord",scale=alt.Scale(domain=chords)),
y = alt.Y("count()",scale=alt.Scale(domain=[0,220])),
color="Chord",
).add_selection(
brush,
)
chart1 | chart2
One of my favorite customizations in Altair is to use a more interesting color scheme. Here is an example using the color scheme “goldred”. You can find more color options in the Vega documentation.
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred")),
tooltip = ["Artist","Song Name","Release Date","Chord"]
)
Sometimes the colors look more natural if they are reversed. We do that by adding reverse=True in the alt.Scale component.
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred",reverse=True)),
tooltip = ["Artist","Song Name","Release Date","Chord"]
)
Spotify chart with tooltip¶
In the following chart we use a different color scheme, we specify the dimensions of the chart to make it a little bigger, and we give the chart a title.
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color('Acousticness', scale=alt.Scale(scheme='turbo',reverse=True)),
tooltip = ["Artist","Song Name","Release Date","Chord"]
).properties(
width = 720,
height = 450,
title="Spotify dataset from Kaggle"
)
Caution
The rest of this notebook can be skipped on a first reading. We give some more advanced examples.
Histogram¶
Here is an example of how to make a histogram using Altair. The heights of the bars indicate how many total entries there are in that category. The count() entry is not the name of a column. Instead it is a special Altair function to count how often that entry occurs.
alt.Chart(df).mark_bar().encode(
x = "Artist",
y = "count()"
)
There are so many artists, this chart is pretty difficult to interpret. Let’s restrict ourselves to the top artists.
Here are the top 19 artists. (Why 19 rather than 20? No great reason, but this particular chart looks better with 19.)
top_artists = df.Artist.value_counts()[:19]
top_artists
Taylor Swift 52
Justin Bieber 32
Lil Uzi Vert 32
Juice WRLD 30
Pop Smoke 29
BTS 29
Bad Bunny 28
Eminem 22
The Weeknd 21
Drake 19
Ariana Grande 18
Billie Eilish 18
Selena Gomez 17
J. Cole 16
Doja Cat 16
Dua Lipa 15
Lady Gaga 14
Tyler, The Creator 14
DaBaby 14
Name: Artist, dtype: int64
Let’s make our Altair chart using the sub-DataFrame with just these 19 top artists. We make this using a new pandas method, isin.
df_top = df[df.Artist.isin(top_artists.index)]
df_top.head()
| Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | Natural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 7 | 3 | 16 | 2021-05-14--2021-05-21 | Kiss Me More (feat. SZA) | 29356736 | Doja Cat | 8640063.0 | 748mdHapucXQri7IAO8yFK | ['dance pop', 'pop'] | ... | 0.701 | -3.541 | 0.0286 | 0.23500 | 0.1230 | 110.968 | 208867.0 | 0.742 | G#/Ab | 0 |
| 8 | 9 | 3 | 8 | 2021-06-18--2021-06-25 | Yonaguni | 25030128 | Bad Bunny | 36142273.0 | 2JPLbjOn0wPCngEot2STUS | ['latin', 'reggaeton', 'trap latino'] | ... | 0.648 | -4.601 | 0.1180 | 0.27600 | 0.1350 | 179.951 | 206710.0 | 0.440 | C#/Db | 0 |
| 10 | 11 | 4 | 43 | 2021-05-07--2021-05-14 | Levitating (feat. DaBaby) | 23518010 | Dua Lipa | 27142474.0 | 463CkQjx2Zk1yXoBuierM9 | ['dance pop', 'pop', 'uk pop'] | ... | 0.825 | -3.787 | 0.0601 | 0.00883 | 0.0674 | 102.977 | 203064.0 | 0.915 | F#/Gb | 0 |
| 12 | 13 | 5 | 3 | 2021-07-09--2021-07-16 | Permission to Dance | 22062812 | BTS | 37106176.0 | 0LThjFY2iTtNdd4wviwVV2 | ['k-pop', 'k-pop boy group'] | ... | 0.741 | -5.330 | 0.0427 | 0.00544 | 0.3370 | 124.925 | 187585.0 | 0.646 | A | 1 |
| 13 | 14 | 1 | 19 | 2021-04-02--2021-04-09 | Peaches (feat. Daniel Caesar & Giveon) | 20294457 | Justin Bieber | 48504126.0 | 4iJyoBOLtHqaGxP12qzhQI | ['canadian pop', 'pop', 'post-teen pop'] | ... | 0.696 | -6.181 | 0.1190 | 0.32100 | 0.4200 | 90.030 | 198082.0 | 0.464 | C | 1 |
5 rows × 24 columns
alt.Chart(df_top).mark_bar().encode(
x = "Artist",
y = "count()"
)
Let’s add color to the chart, using the average number of Streams for each artist. In this example, mean is a special function in Altair, just like count.
Spotify bar chart¶
alt.Chart(df_top).mark_bar().encode(
x = "Artist",
y = "count()",
color = "mean(Streams)"
)
Exercise
Copy the above histogram code, and replace mean with sum. Suddenly the colors are less interesting. Why do you think that is?
Interactive example¶
We end with an example just for inspiration. One of the distinguishing features of Altair is its support for interactivity. If you click and drag on the below chart, the points in the region you select will gain color.
brush = alt.selection_interval(empty='none')
chart = alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.condition(brush,
alt.Color('Acousticness:Q', scale=alt.Scale(scheme='turbo',reverse=True)),
alt.value("lightgrey")),
).add_selection(
brush,
).properties(
width = 720,
height = 450,
title="Spotify dataset from Kaggle"
)
chart